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EXECUTIVE SUMMARY 


Official statisticians have been using a diversity of data sources in the production of 
official statistics for decades, including “designed” data sources such as censuses and 
surveys, and “found” data sources such as administrative and transactional data. 


As a result of more and more interaction with digital technologies by citizens, and the 
increasing capability of these technologies to provide digital trails, new sources of data 
have emerged and are increasingly available to official statisticians. Such sources 
include data from sensor networks and tracking devices e.g. satellites and mobiles 
phones, behaviour metrics e.g. search engine queries, and on-line opinion e.g. social 
media commentaries. The collective term for such data sources is Big Data. 


Whilst Big Data have the potential to create a rich, dynamic and focussed picture of 
Australia for informed decision making, and to improve the efficiency in the 
production of official statistics, this paper contends that there are a number of issues 
that an official statistician has to consider before deciding if a particular source from 
Big Data can be used for the regular production of official statistics. 


A principal decision is business need and business benefit. This includes 
consideration of whether the new data source will improve the offerings of an existing 
statistical series, or plug statistical data gaps e.g. increasing the frequency of release, 
improving the richness of details such as small area or small population group 
statistics, or providing new official statistics that cannot be cost effectively provided 
using existing data sources. It also includes assessment of the business case in using 
the new data source, such as whether there will be a reduction of cost in the statistical 
production or reduction in provider load, and assessment of the quality of statistics 
produced from Big Data using Data Quality frameworks, against the benefits to be 
provided from the new source. 


Another key decision is the validity of statistical inferences from Big Data. Big Data, 
depending on the source, suffer from one or more statistical biases, e.g. coverage bias, 
representational bias or self-selection biases, and measurement errors. Unlike errors 
due to sampling, the magnitude of these types of error will not be reduced by 
increasing the size of the data set. 


The challenge for official statisticians is to develop a suitable methodology for 
analysing such data sets so that any conclusions drawn from the analysis are valid 
statistically. Firstly, official statisticians need a methodology to address any bias from 
Big Data, and secondly, a methodology in using Big Data to produce fit-for-purpose 
official statistics. 


A Bayesian inference framework is adopted in this paper to assess the conditions 
under which valid statistical inference can be drawn from Big Data. The conditions 
are similar to those for making valid statistical inference from survey data: that any 
underlying process for the inclusion or exclusion of information from the Big Data 
source is independent of that information per se. 


By treating Big Data as auxiliary information, and integrating census and survey data — 
ground truth data — with Big Data, this paper also provides a Bayesian method for 
using new data sources to produce official statistics. For count data, a dynamic logistic 
regression model is used. For continuous data, a dynamic linear model is described. 
The dynamic logistic model is applied to the theoretical analysis of satellite imagery 
data for the prediction of crop growing areas in Australia. 


Other relevant issues for the official statistician to consider when deciding if a 
particular source from Big Data is to be used for the production of official statistics 
are: privacy and public trust, data ownership and access, computation efficiency and 
technology infrastructure. 


Until recently, the Australian Bureau of Statistics’ (ABS) progress in Big Data domain 
has been primarily review and monitoring of industry developments while 
contributing to external strategic and concept development activities. This paper 
summarises the ABS Big Data Strategy with objectives to build an integrated 
multifaceted capability for systematically exploiting the potential value of Big Data for 
official statistics. 


This paper also describes the ABS Big Data Flagship Project, which has been 
established to provide the opportunity for the ABS to gain practical experience in 
assessing the business, statistical, technical, computational and other issues related to 
Big Data as outlined earlier in this paper. In addition, ABS participation in national 
and international activities on Big Data will help it share experience and knowledge, 
and collaboration with academics will help ABS better acquire the capability 
addressing business problems using Big Data as a part of the statistical solution. 
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ABSTRACT 


Whilst Big Data have the potential to improve the statistical production and statistical 
offerings, this paper outlines the issues that need to be considered by the official 
statistician, before a particular Big Data source can be used for the regular production 
of official statistics. In addition, the paper outlines Bayesian methods for analysing 
Satellite imagery data, and also the ABS strategies and initiatives on Big Data. 


1. INTRODUCTION 


Recent discussions and debates in the public domain about the opportunities 
presented by Big Data have now permeated into the sphere of official statistics — 
recent significant events include the discussion of a paper entitled “Big Data and 
Modernisation of Statistical Systems” by the United Nations Statistical Commission 
(2014), and the adoption of Scheveningen Memorandum on “Big Data and Official 
Statistics” by the Heads of European Statistical Offices (Eurostat, 2013). Whilst 
official statisticians have long been using administrative data and business data — one 
of the many sources for Big Data — in the production of official statistics, they are 
generally, and understandably, cautious in fully embracing this practice to other types 
of Big Data. 


Almost always, the public discourse about Big Data is Information and Communication 
Technologies (ICT)-centric, and is largely preoccupied with the computing 
infrastructure, systems and techniques needed to effectively and efficiently handle the 
“volume, velocity and variety” of emerging Big Data sources. Translating this into the 
context of official statistics, it is about increasing the technological capability of a 
National Statistical Office (NSO) to capture, store, process and analyse Big Data for 
statistical production. Such debate raises a number of significant questions for official 
statistics, which are outlined below in increasing order of importance. 


Firstly, is “Big Data technology” sufficiently mature to warrant an investment by the 
NSO? The widely-used Gartner Hype Cycle (Rivera and van der Meulen, 2013), which 
assesses the maturity of emerging technologies, places Big Data at the “Peak of 
Inflated Expectations” in 2013. It is considered unlikely that it will reach the “Plateau 
of Productivity” associated with mainstream uptake within the next five years. 
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Secondly, what is the likely benefit of using Big Data for official statistics, beyond that 
of administrative data and some types of business data? While there is undoubtedly 
some value in exploratory analysis of novel Big Data sources for opportunistic use, the 
proposition that a statistical producer should routinely acquire such data sets without 
an explicit business need, and business case, is tantamount to a Big Data solution in 
search of a problem. NSOs, faced with increasing budget pressure, are not willing to 
invest in Big Data unless there is a strong business case for investment. 


Finally, how can Big Data be used to provide reliable and defensible statistical 
outputs? Crawford (2013) argued that “... hidden biases in both the collection and 
analysis stages present considerable risks, and are as important to the Big Data 
equation as the numbers themselves.” The proposition that bigger datasets are 
somehow closer to the “truth” is not accepted by statisticians, since the objective 
“truth” is very much dependent on how representative a particular Big Data source is 
of the underlying population and the nature of the statistical inference drawn from 
such data. Other issues concerning the use of Big Data in Official Statistics are 
outlined in Daas and Puts (2014). 


In spite of these issues and concerns, it is our view that Big Data, Semantic Statistics 
(Clarke and Hamilton, 2014), and Statistical Business Transformation (HLG BAS, 2012; 
Pink et al. , 2009; and Tam and Gross, 2013) are the three most promising initiatives 
for radically transforming the future business model and information footprint of 
NSOs. The Big Data challenge for official statisticians is to discover and exploit those 
non-traditional data sets that can augment or supplant existing sources for the 
efficient and effective production of ‘fit for purpose’ official statistics. Indeed, a 
number of international and national statistical organisations have already started to 
explore the potential for Big Data (United Nations Statistical Commission, 2014; 
Eurostat, 2013). 


The purposes of this paper are to: 


e Highlight some Big Data concepts, and outline concerns about the business value, 
methodological soundness, and technological feasibility of utilising Big Data for 
official statistics production; 


e Provide a preliminary statistical framework for assessing the validity of making 
statistical inference for official statistics, and application of this framework in 
analysing count data from satellite sensing and magnitude data; and 


e Present an outline of the statistical activities being undertaken in the ABS to assess 
the business case for using certain types of Big Data to replace or supplement an 
existing data source, to create new statistics, or improve the operational efficiency 
of the Australian Bureau of Statistics (ABS). 


2 ABS °¢ BIG DATA, STATISTICAL INFERENCE AND OFFICIAL STATISTICS * 1351.0.55.054 


2. DEFINITION, USES AND SOURCES OF BIG DATA 
What is Big Data? 


According to the Big Data Privacy Report (Podesta et al., 2014), 


“ .., there are many definitions of Big Data, which differ on whether you are a computer 


scientist, a financial analyst, or an entrepreneur pitching an idea to a venture capitalist.” 
Wikipedia (2014) defines it as 


“ _.. a blanket term for any collection of data sets so large and complex that it becomes 
difficult to process using on-hand database management tools or traditional data 
processing applications.” 


Big Data is often defined by its characteristics along three dimensions (Daas and Puts, 
2014): 


e Volume — the number of data records, their attributes and linkages; 


e Velocity — how fast data are produced and changed, and the speed at which they 
must be received, processed and understood; and 


e Variety — the diversity of data sources, formats, media and content. 
What are the uses of Big Data? 
Manyika et al. (2011) argued that 


“ _., there are five broad ways in which using big data can create value. First, big data can 
unlock significant value by making information transparent and usable at much higher 
frequency. Second, as organizations create and store more transactional data in digital 
form, they can collect more accurate and detailed performance information on everything 
from product inventories to sick days, and therefore expose variability and boost 
performance. ... Third, big data allows ever-narrower segmentation of customers and 
therefore much more precisely tailored products or services. Fourth, sophisticated 
analytics can substantially improve decision-making. Finally, big data can be used to 


improve the development of the next generation of products and services.” 


Another potential benefit of Big Data is in providing more regular, and timely 
information on interesting patterns such early indicators of epidemics, economic 
upturns or downturns e.g. Google’s flu indicators despite its problems, 
unemployment or housing boom etc., thanks to the lower unit cost of acquiring Big 
Data sources than the traditional direct data collection methods used by NSOs. An 
excellent example of this is provided in Choi and Varian (2011) who also coined the 
term “nowcasting” to describe the process of predicting the present by harnessing 
information from Google Trends. In a blog to the Washington Post, Mui (2014) 
argued that the currency of statistics afforded by Big Data — readily available as a by- 
product of other collections — and how they could be “mined” for interesting patterns, 
are promising benefit of Big Data over traditional data sources. 
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On the other hand, Harford (2014) argued, whilst unearthing correlation from Big 
Data is cheap and easy, correlation, as statisticians have at pains been pointing out, is 
not the same as causation, and “... a theory-free analysis of mere correlations is 
inevitable fragile.” 


From the Official statistics perspective, Big Data can be defined as statistical data 
sources comprising both the traditional sources and new sources that are becoming 
available from the “web of everything”. Whilst the volume and velocity of Big Data are 
huge and are beyond current data management or processing capabilities, we contend 
that NSOs do not necessarily need to use the full data set for the production of official 
Statistics, as sampling methods may be applied to provide fit-for-purpose statistics. 


Whilst not all Big Data variety are suitable for the production of official statistics, they 
have the potential to increase the cost efficiency of NSOs, provide new statistical 
products and services, and increase the frequency in the production of official 
statistics at little additional cost to NSOs. Big Data may provide an opportunity for 
NSOs to better fulfil its mission in the provision of official statistics for informed 
decision making. However, we contend that decisions on which Big Data source to 
use, including decisions on volume, velocity and variety, have to be assessed against 
the cost-benefit criteria outlined in the latter part of this paper. 


Collectively, the wide variety of extant and emerging Big Data sources of interest to 
official statistics may be broadly categorised as follows (United Nations Statistical 
Commission, 2014): 


° Sources arising from the administration of Government or private sector programs, 
e.g. electronic medical records, hospital visits, insurance records, bank records etc.. 
The source from Government programs has traditionally been referred to as 
administrative sources by official statisticians; 


e Commercial or transactional sources arising from the transaction between two 
entities, e.g. credit card transactions and online transactions (including from mobile 
devices); 

e Sensor networks sources, e.g. satellite imaging, road sensors and climate sensors; 


° Tracking device sources, e.g. tracking data from mobile telephones and the Global 
Positioning System (GPS); 


e Behavioural data sources, e.g. online searches (about a product, a service or any 
other type of information) and online page views; and 


e Opinion data sources, e.g. comments on social media. 
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Censuses and surveys and the first two sources above, i.e. administrative data, and toa 
limited extent, business data (e.g. scanner data from supermarkets, motor vehicle 
sales data etc.) are currently the principal sources for the production of official 
statistics. Big Data open up opportunities for new data sources for NSOs. 


While some of the Big Data sources are identifiable (e.g. satellite sensing data with 
pixel longitudes and latitudes), many others are not (e.g. prices of on-line goods and 
services, scanner data, or commercial transactions). Both identifiable and 
unidentifiable data sources have their respective uses in official statistics. For instance, 
satellite sensing data can be combined with data provided by farmers in agricultural 
surveys at the unit record level, whereas on-line prices data can be used in calculating 
the price relatives for use in Consumer Price Index (CPI) compilations. The challenge 
for official statisticians is to find effective and valid ways of utilising the Big Data 
sources, where their use in the regular production of official statistics in justified. 
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3. BIG DATA AND OFFICIAL STATISTICS 


Many NSOs in the world, including the ABS (ABS, 2013), have developed significant 
and relevant expertise in collecting and processing large amounts of data, and are: 


e empowered under its legislation to compel the provision of information by 
providers for the purposes of producing official statistics; 


e an authorised integrator of sensitive data under the statistics legislation; 


e given its holdings on statistical benchmarks, uniquely to assess the quality and 
“representativeness” of Big Data sources; 


e able to produce statistics that are of high quality — so that users can be assured that 
the information they are using is ‘fit for purpose’; and 


e independent, of high-integrity and impartial. Most NSOs publish the concepts, 
sources, methods, and results of all collections, and it provides “level playing field” 
access to all users of official statistics. 


Together with the high level of community trust placed on official statistics (see for 
example ABS, 2010b), these attributes put many NSOs in a good position to 
experiment with, and explore, the potential use of Big Data. 
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4. OPPORTUNITIES AND CHALLENGES FOR OFFICIAL STATISTICS 


To continue to improve its statistical value proposition, many NSOs strive to reduce 
the cost of statistical production, improve the timeliness and frequency of its 
offerings, and create new or richer statistics that meet emerging statistical data needs. 
As part of its business transformation program to deliver on these aspirations, some 
NSOs (e.g. the ABS, Statistics Netherlands, the Italian Statistics Office to name a few) 
are undertaking initiatives to exploit particular Big Data opportunities. 


It is our view that a number of applications of Big Data may be identified by drawing 
parallels with the well-established use in official statistics of administrative data, 
provided that the sources meet the benefit criteria and statistical validity issues 
outlined in this paper. These applications include: 


e sample frame or register creation — identifying survey population units and/or 
providing auxiliary information such as stratification variables; 


e full data substitution — replacing survey collection, 
e partial data substitution for a subgroup of a population — reducing sample size; 


e partial data substitution for some required data items — reducing survey instrument 
length, or enriching the dataset without the need for statistical linking; 


° imputation of missing data items — substituting for same or similar unit; 

e editing — assisting the detection and treatment of anomalies in survey data; 

e linking to other data — creating richer datasets and/or longitudinal perspectives; 
e data confrontation — ensuring the validity and consistency of survey data; and 


e generating new analytical insights —- enhancing the measurement and description of 
economic, social and environmental phenomena. 


While the primary focus is the exploitation of Big Data largely for richer or more 
timely statistical offerings, Big Data generated in-house can also be used to improve 
the efficiency of statistical operations of NSOs (Groves and Heeringa, 2006). These 
include: 


e improving the data provider and data consumer experiences; 
e improving the operational business efficiencies; and 


e monitoring the Web and network security and end-user network experiences. 
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5. BUSINESS BENEFIT 


The decision to use a particular Big Data source in statistical production should be 
based strictly on business need, and the prospective benefit established on a case-by- 
case basis — how it might improve end-to-end statistical outcomes in terms of 
objective costs-benefit criteria. That is, the costs and benefits of using the new data 
source need to be assessed in terms of factors such as reduction in provider load, 
sustainability of new source, as well as the accuracy, relevance, consistency, 
interpretability, and timeliness of those outputs stipulated in Data Quality Frameworks 
(ABS, 2010a; Brackstone, 1999; OECD, 2011). 


As an example, the full data substitution of survey-based with satellite sensing data for 
producing agricultural statistics — such as land cover and crop yield — can be assessed 
as follows: 


e Costs — What are the likely costs for acquiring, cleaning and preparing the satellite 
sensing data in a form suitable for further official processing, noting that the 
computational demands of acquiring, transferring, processing, integrating and 
analysing large imagery data sets are presently unknown, but are likely to decline 
over time? What are the costs for the development of a statistical methodology to 
transform the satellite sensing data into crop yields; and the development of a 
statistical system to process and dissemination satellite sensing data? What are the 
equivalent costs for direct data collections, and how they compare with one 
another? 


° Reduction in provider load — How much reduction in provider load would result if 
direct data collection is replaced by satellite sensing data? How important is it to 
have this reduction, based on existing provider experience, and prevailing 
Government policy on reduction of regulatory “red tape”? What is the current 
degree of cooperation from the farmers and how likely is this degree change for the 
better, or worse, in the future? 


e Sustainability of the statistical outputs — Is the data source available to official 
statisticians for the regular production of official statistics? How likely will the 
source be discontinued in the future? 


e Accuracy, relevance, consistency, interpretability, and timeliness - How does the 
new source of data compare with the current source, against the criteria outlined in 
Data Quality Frameworks (ABS, 2010a; Brackstone, 1999; OECD, 2011). Whilst 
satellite sensing provide accurate measurements of “reflectance” — measures of the 
amount of light reflected on objects - there are missing data from Landsat 7 
missions (see below), and from cloud covers. Are these issues bigger, or smaller, 
than missing data issues from direct data collections? In addition, transforming 
reflectance into crop production statistics require scientific or statistical modelling, 
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an endeavour not commonly adopted by NSOs and may raise interpretability issues. 
As satellite sensing data are available once every fortnight, they clearly have a 
distinct advantage over annual or sub-annual direct data collections in terms of the 
frequency in the availability of crop yield statistics. 
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6. VALIDITY OF STATISTICAL INFERENCE 


Data sets derived from Big Data sources are not necessarily random samples of the 
target population. The design-based statistical inferences adopted by most NSOs for 
estimating finite population parameters such as population means, totals, and 
quantiles rely on random samples, i.e. the selection mechanism does not depend on 
the values of the units not selected in the sample (Sarndal, Swensson and Wretman, 
1977; Kish, 1965); or statistical models to adjust or address the selection bias from 
non-random samples (Puza and O’Neill, 2006). 


As an example, social media services (such as Twitter) are a rich data source for the 
measurement of public opinion. However, there is little verifiable information about 
the users of these services, and it is difficult to determine whether the user profiles 
are “representative” of the population in general. In fact, it is to be expected that 
some population subgroups will be under-represented in any sample of social media 
data, due to the differential adoption rate of new technologies. Where the non (self) 
selection in the social media is dependent on these people’s public opinion, estimates 
of population opinion from such sources, without proper modelling and adjustment, 
are subject to bias (Smith, 1983). 


In general, being custodians of large number and variety of statistical benchmarks, 
NSOs are uniquely positioned to assess the representativeness of the underlying 
population of Big Data. In some cases the Big Data might need to be supplemented 
with survey data to get coverage of un-represented segments of the population. In 
other cases it may be useful to publish statistics that describe sub-populations. A 
related issue is that the statistical analysis of large, complex heterogeneous datasets 
will inevitably yield significantly more spurious model-dependent correlations than 
would be expected from traditional data sources. This can actually accentuate any 
modelling bias by reinforcing the selection of the wrong variables, algorithms and 
metrics of fitness. 


As an example, Google Flu Trends — which uses the number of online searches as a 
measure of the prevalence of flu in the general population — mistakenly estimated that 
peak flu levels reached 11% of the U.S. public in the 2012 flu season. This was almost 
double the official estimate of 6% published by public health officials. Google Trends 
explained the over-estimation by “... heightened media coverage on the severity of 
the flu season resulted in an extended period in which users were searching for terms 
we've identified as correlated with flu levels” (Google Trends, 2013). This highlights 
the importance of assessing under what conditions, and for what applications, the use 
of Big Data require adjustment or no adjustment, in order to provide statistical 
estimates that are of the same level of quality as the official statistics regularly 
published by the NSOs. 
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6.1 A Theory for Big Data statistical inference 


Couper (2013) outlined some significant issues the analyst has to consider when 


making inferences using Big Data, including coverage bias, selection 


(representational) bias, measurement bias and response bias. The Section attempts to 


provide a framework for considering such issues. It is useful to conceptualise the 


following elements of an inference framework for Big Data: 


i 


Target population, U (of size u) — the population of interest to the NSOs on which 
statistical inferences are to be made. In the Twitter example, this may be the 
population of Australia aged 15 years and above. In the remote sensing example, 
this may be the agricultural land parcels of Australia; 


Big Data population, Up (of size b) — the actual population included in the Big 
Data. In the Twitter example, this will be the registered Twitter users. For the 
remote sensing example, this can be the land parcels of Australia. If the coverage of 
Ug is not the same as the coverage of U, inference based on Ug will suffer from 
coverage bias. For the rest of this paper, we assume that the coverage of Up is a 
subset of U, and conceptualise Up as a sample (random or otherwise) from U — see 
point 6 below — with the coverage bias to be addressed through statistical modelling 
of the missing data process “R” — see point 7 below; 


Vector of measurements of interest to the NSO, My. This could be consumer 
confidence or crop yields; 


Vector of proxy measurements available from Big Data, Zp. This provides the 
proxy variables, or covariates, to be used to predict My. From points 1 and 2 
above, we can consider Zp as a sample (random or otherwise) of measurements 
from U to predict My. In the Twitter example, Zp could be the sentiment data to 
predict consumer confidence, My. In remote sensing example, Zp comprise 
reflectance measurements from selected wavebands captured by remote sensing 
missions, for discrete pixels of sizes ranging between 10 m’ to 1 km’, to predict the 
annual production of certain types of crops, My. 


A transformation (or measurement) process, “T”, is generally required to transform 
the data, Zp , to the measurements of interest, My. In the remote sensing 
example, this may be transforming the observed reflectance in selected wavebands 
captured by the remote sensing mission into the crop types — see Section 6.2 for an 
example of the “T” process. This is generally a complex scientific or statistical 
modelling process requiring detailed understanding of the reflectance 
characteristics of the different ground cover types, which in turn are dependent on 
the selective spectral absorption characteristics associated with their biophysical 
and biochemical compositions (Richards, 2013, p. 12). 
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6. | Asampling process — random or otherwise — “I” is used to conceptualise the 
selection of Up from U. In many Big Data examples, “I” is unknown, and requires 
in-depth contextual knowledge to develop proper statistical models to represent it, 
if at all. Depending on the type of the Big Data source, this is not generally a 
straight forward process. However, with remote sensing data through Landsat 
satellite series (Landsat, 2013), one has the fortunate situation that the coverage of 
Uand Ug are identical, making the “I” process superfluous in this case; 


7. | Acensoring (missing data) process, “R’, which renders parts of the vector, Zp , not 
available. Where the coverage of Up, is not the same as U, one could conceptualise 
a “R” process in play rendering observations in the target population, U, missing. 
Another instance of missing data will be incomplete observations from Zp . In the 
remote sensing example, missing data could be due to bad weather, or something 
more systemic — see Section 6.2 below. For this reason, whilst the “I” process can be 
subsumed in the “R” process, for the purpose of this paper, we conceptualise them 
as two separate processes. We will use the notation, Zp, , to represent the 
observed covariates or proxy variables from Big Data. 


For finite population inferences, we are interested in predicting g(My ), where g(-) 
denotes a linear or non-linear function. The data that we have for the inference is 
Zp (which “survives” the selection and censoring processes I and R). 


We denote by /f(-) the probability density function (pdf). We assume the pdf, 

f (Yy;8), indexed by the unknown parameter, 8, with known prior distribution 

f (8), is known. Predicting My from Zy will generally be a scientific process — for 
example, converting remote sensing data, Zy,, into crop yield data, My,. For the 
purpose of this paper, we also make the assumption that the pdf f (My Zu ; @) is 
known, as is the prior f(@) of the parameter @. It is further assumed that the 
parameters, @ and 80, are distinct (Rubin, 1976). 


Following Rubin (1976), Little (1982), Little (1983) and Smith (1983), the task of the 
Statistician, using a Bayesian inference framework, is to calculate the posterior 
distribution f (My |Zgo,1,R), or simply | My |Zgo.LR | to simplify the notation. 
Writing Zy =(Zpo,Zc) to split up the Zy variables of the target population into 
those from the Big Data and the remainder, we have 


[My |Zgo.LR] & ff [Mu.Zpo.ZcLR,0,9] d0dodzZ 


= fff [RIMy.Zp0.Zc.1,6,@ | [l|My, Zgo:Zc, 8 @ | [Muy,Zp0>Zc,8,9] d0 de dZ¢ 


x f[f [Mu.Zpo.Zc,90] dadodzZ (1) 
~ [Mu Zpo | 
x [My |Zpo | (2) 
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provided that the following ignorability conditions (1) for sampling and censoring are 
satisfied: 


[R |My, Zpo.Zc,1,0,9] = [R|My,Zpo.1| 
and [I |My, Zpo.2c,89] = [1|Zpo |. 


In other words, subject to the fulfilment of these conditions, the scientific process to 
translate the Big Data observations Zp, into the measurements of interest My can be 
performed by disregarding the sampling and censoring processes. 


Now 


8 


fl [My,Zpo.Zc,9,9|d0d9 dZ, 
= i} [My,Zy,9,0] d0dpdZ, 
= [ff [Mu|Zu.0][o] [Zy|@] [0] aededz 


JE (Mu|Zu.@) Ee (Zu|®) dZc (3) 


[My |Zpo | 


where E,(-) and Eg(-) denote the expectation with respect to f(@) and /(@) 
respectively. 


For analytic inferences, the interest will be on estimating the parameter @ of the pdf, 
ii (My | Zy ;@) . Now similar to the derivation of (1), 


[@|ZpoLR] © [9 Zpo.1R] 


x [f [Zgo.Zc.1R, 8,9] d0dZ¢ 

= [f [8 [220.2109 | [1|Zpo.2c.8,@ | [Zp0.Zc, 9,9] d0dZ¢ 

x [f [Zg0,2c,0,0] d0dZ¢ (4) 
© [0 |Zp0 | (5) 
provided that the following ignorability conditions (II) are satisfied: 


[Bl ZagiZ esl Oo |= "[R | Ze | 


and [1|Zpo.Zc,9,o] = [1|Zpo |- 
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Likewise, 


8 


[[] [Mu.2u.8,0]d0dZ_ dMy 
i] [My|Zy,0 |[o][Zy|@ | [0] d0dZ, dMy 


[o] J [My|Zu. | E9 (Zy|8)dZ¢ dMy . (6) 


| 9|Zp0 | 


II 


The ignorability conditions (1) and (I) are also known as Missing At Random 
conditions (Rubin, 1976). The inference framework outlined in this paper is similar to 
the one described in Wikle et al. (1998) for predicting temperature data, but is 
adapted to official statistics, as well as extended to address missing data from the “R” 
process, and the sampling process, “I”. 


Whilst the ignorability conditions may not be fulfilled, if NSOs have access to other 
variables, Xg, such that | 1,R| My,Xg,Zpo.Zc, 8,9 | =|1,R| Xg,Zpo | is fulfilled, 
then | My | Xg,Zpo.LR |=[Mu|Xp.Zpo |. (Tam, 2014) 


Often, NSOs will have some information on My (say, M,, ) from sample surveys or 
administrative data sources, where “so” denotes the observed value from such 
sources. Conceptually, we assume that a sampling process, J , and a censoring 
process, R , are at play, where the 7 and R are considered to be independent of the 
I and R processes. 


In this case, inference on My should be based on [| My | Meo Zpaid RiLR | 
Now, 


[FR My» Mso:ZporLR |[Mu»Mso»ZpolR] 
[ 7, R| Myo ZpooLR |[Mgo,ZpoLR] 


II 


| My | Mg Znpi Fs LR | 


[My | Myo: ZpoLR | 


provided that 
[.7,R|MyMso,Zpo LR] = [7,R|Mgo.ZporLR | 


i.e. the J,R do not depend on the unobserved and unsampled or unobserved values 
of My —ignorability condition (IID). This is a weaker condition than Ignorability 
Condition I, since design-based sampling conditions are generally used in NSOs, and 
the incidence of non-response is low for mandatory surveys. This is certainly the case 
in the ABS. 
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Example 1 


(Puza, 2013) Assume for the simple case where My = Mg, Zp, = Zp, and the 
sampling and censoring processes — 7,1,R and R —are ignorable. Under the models: 


Mz|Zp,B ~ N(ZpB,z) 
B| By, Q ~ N(Bo,Q) 


we have [Mu |M,.o.Zp | = [Mc|M,oZs | = N(p,w) 
and [B|M...Zs | = N(B.D) 
where 


. r F 
w= Z, B ae pare ene (M,o _ Zs6 B) ) 

eb ee) aR Fee > z.-r. ez. \p(z.-z. acl Zz ) 
WV ~ rr rso*~SOSO rso + r rso*™SOSO*~ SO r ? 


rso“soso“so 
A -1 ’ -1 

B = D(a Bo + ZioZsos0Mso } 

D 


SO“~SOSO*~ SO 


S(O eZee ee 
Mp = (Mjo,Mc): 


De (ZT), 


SO°?"r 


y — 2 Gi pa 
x Kes 


sor rr 


The above theory shows that for proper finite population and analytic inferences, 
equations (1) and (4) should be used. In general, in most Big Data applications the 
specification of the censoring model (e.g. how censoring is dependent on the 
unobserved data in the target population but not in the Big Data population) and the 
sampling model (e.g. how sampling is dependent on unobserved measurements or 
proxy measurement) can be subjective and difficult to specify, although we note that 
there is a vast body of statistical literature to address non-ignorable situations (Puza 
and O’Neil, 2006; Heckman, 1979; Little, 1982; Little, 1983; Little and Rubin, 2002; 
Madow, Oklin and Rubin, 1983; Smith, 1983 and Wu, 2010 are just some examples). 
The challenge for the official statistician is to find and use models that meet the 
integrity requirements of official statistics. Where information available to the official 
statistician suggests that the ignorability conditions are fulfilled, then analyses of Big 
Data can proceed as if it is a random sample from the target population. 
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6.2 Analysis of satellite sensing data 


For remote sensing data, let My represent, for simplicity, the vector of binary 
variables for all the pixels in U, with a value of 1 assigned for the pixel that contains a 
certain crop of interest to the NSO, and 0 otherwise. Let Zy be a matrix comprising 
row vectors of remote sensing data (consisting of reflectance measurements from the 
satellite on-board sensors) and a column vector of “ones”. 


As the full data Zy, is available from Landsat (Landsat, 2013), Zg =Zy. That is, there 
is no sampling involved so the first requirement of ignorability conditions (1) is 
satisfied. However, when there is missing data, then the second requirement of 
ignorability conditions (1) needs to be checked. Where missing data is due to random 
bad weather, it may be safe to assume that the missing data is not associated with the 
reflectance measurements, and, if so, we may treat the resultant dataset as a random 
sample. In the case where missing data is due to systemic effects — such as the 
problems that occurred in May 2003 which caused approximately 22% missingness of 
the Landsat 7 imagery data that had to be replaced by other data — an assessment is 
required on whether it is acceptable to assume the observed data set comprises a 
random sample. 


An interest will be to use the reflectance measurements to predict the total yield of 
the particular crop in question, which requires the specification of a “T” process. A 
review of the remote sensed information models and crop models for this “T” process 
is provided in Delécolle et a/. (1992). Alternatively, a statistical model like a logistic 
regression model may be used, provided that the NSO has information on M,, , 
available from agricultural censuses or surveys. For the simple case where My = Mz, 
Zpo = Zp, the sampling and censoring/response processes — 7,1,R and R — are 
ignorable, and letting “a” denote the size of a pixel, then the estimated yield of a 
certain type of crop will simply be 1.Mca+1,,M,,a. The statistical task is to 


predict Mc, given M,, and Zp. 


So long as the above formulation does not take into account of previously available 
information, e.g. Zp and M,, from earlier time points, the analysis is not optimal. A 
proper Bayesian analysis must take account of this information. To address this, we 
introduce the following notations and models. 


To simplify notations, we shall drop the subscript “B” in the sequel. We shall also 
introduce a “t” subscript to denote time. Let 


¢ —-M, denote the bx1 vector Mig ={my,.--,Myp} ; 

e m,; be a Random Variable (to be defined in Example 2 below) for pixel i, where 
i=1,...,b; 

e Z,; be the p x1 vector of reflectance and an intercept for pixel i; 
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° pH es 

° B, be the p x1 vector of unknown coefficients; 

e o(Z,B,)= (1 $e 2B: a be the logistic sigmoid for pixel i; 
¢ 0 (Z{B,) be the bx 1 vector {o(Zi B,),--.6(Zi, B,)} 


° Mis = Mis, be the n, x1 vector of “ones” or “zeros”, where “one” is recorded if the 
pixel grows the crop of interest to the official statistician, and zero otherwise. This 
information is obtained by “ground truthing” and is also referred to as the training 
dataset in the Machine Learning literature; 


© = Mic (N, x1) is defined by the equation M} = {Mic, Mi, }, where b = N, +n, ; 


° Z, be the ux p matrix of all reflectance and a column of ones available from 
satellite sensing data; 


° Z,s be the n, x p matrix of reflectance and a column of ones corresponding to the 
training dataset; 


e MO? = {Mj,,..., Mis}; 
@ AO SA Ziicvej Zieh Sand 
© DO ={M, 20}. 


Example 2 


Under the models: 


m,; ~ independent Binomial Logistic (o (Zt B, )) 
B. =Bit+e, B, LZ, 
e, ~ independent N(0,Q,), ¢, 1 D® 


for i=1,...,.b, and Q, known for t =1,...,7, then 


My | D x independent Binomial Logistic (o(z, B..)) 


for i 4,222, brand tle. t where Bu is the maximum likelihood estimator of the 
D | , and %_; is the inverse of the negative of the 


A 


posterior distribution |B, 


Hessian of the posterior distribution evaluated at B,, given by: 


Bur = Bape + irae {ZisMs zz Zi6 (Zi Ber ) ) 


and Vy = VapatrQ. (8) 
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In addition, 

|B, | Dp” | ad N(Byr-Ze} 

’ -1 es 

where Zee = (Zi.Ws (B,) Zs =f Eye-1 | 
and 2%), is evaluated at B, = By: 
Finally, the posterior mean of 1j¢ Mica +1j,Mj,a is 

lic o(Zi, Bete at+l1i.M,.a, 
and the posterior variance is lic Wic (Bi lic a 
where Wc (Bc is an N, x N, diagonal matrix, with diagonal elements, 

5 (Zi Bye)(1- (2% Bye) for i= 1...,Nr. 


To prove the results, we first let |B, | D® | a6 (B, ) , say. Using the Taylor series for 
vectors, it can be shown that 


8 P| ~ N(Byr-Z) 
where By = arg max, InG (f, ) 
. 1 - 
and Lye = {-v inG(B,)} evaluated at By, 


where V*InG (B, ) denotes the second derivative of InG(B,) with respect to B,. 


From 
[m,|D® |] = [m,.|D® | 
= f[m..|D., |/B,]D® | ap, 
2 [TT o(2i8.)™ (1-9(Zi B.)"™ JB. |D® | aB, 
~ TT o(Zi Bo) (1 ~6(Zi, By et) 


using the standard form for Laplace approximation for integrals (Tierney et al., 1989), 
given that |B, | D®| & N (Bye) 


This proves the first part of the result of Example 2. 
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Now 


[B,|D® | = G(B,) 


x [Myy|Z,D°-,B, |[B,|Z,,.D° ]. 


From the model assumptions of B,, it follows that 


[2.0] = [alo 
f [B.[D°?.B+ |[Ba|D°? | 4B, 
= J [B.1B1][B1|D° ] 4B, 


II 


= N(Bye-1/ 2a) (9) 
where 
Bua = EB] 2D | 
= E| Ba | Z| $ Ble, | Zpe| 
- = .4|2.0°] 
= Bea (10) 
and 


Zyt-1 = V| Bean | Z,, p> aN: le, | noes 
= Zarit Q . e5)) 


Now (7) follows by noting that 


V InG(f, ) 


Vin| My Te p,8, | +V in| B, Zs Pe 


ZisMis =< Zig0(Z, B, } 7 >a (B, = B.-st-1) (12) 


II 


with the well-known first term of (12) derived in maximum likelihood estimation for 
logistic regression models — see for example, Czepiel (2002), and the second term of 
(12) follows from the multivariate normality of B,; and (8) follows from noting that 


Eye = -V7In| Mis] Z,D°,B, |-¥7[B,|Z,,.D°? | 


’ -1 
= ZsWis (B,) Zs + Zee 
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where W,, (B, ) isan n, xn, diagonal matrix, with diagonal elements 
6 (Zi B,){1- o(Zij B,)} shor Aza., Tig 


Comment 1. The restrictive assumption that Q, is known for all t can be relaxed by, 
for example, assuming 


Q. = By -afe- /h, 


for scalars 2, such that A, maximises Ke (B, )@B, (McCormick et al., 2012). 


Comment 2. Whilst (7) is of interest in its own right, it is not useful to calculate Bie ; 
given non-linearity of the equation. Instead, a Newton-Raphson method to calculate 
But is proposed, as follows: 


Ber? = 8 4 2Vv inc (8) 


tlt tlt tlt 


where we set the starting value of Bi to be Beats 


Comment 3. The results in this Section can be readily extended to multinominal 
logistic regression models (see Czepiel, 2002). 


6.3 Analysis of continuous Big Data 


In Big Data scenarios where My, can be modelled as a continuous variable, and where 
the data, M,,, are regularly observed, Tam (1987) extended the model B| By, Q 

~ N(Bo,Q) in Example 1 to B, —B,_,|Q ~ N(0,Q,), where t denotes time. Noting 
the above models form a State-Space model, and Kalman (1960) provided the best 
linear unbiased predictor for B, given D. In this Section, we extend Puza (2013) to 
provide the predictive distribution, | Mec | p® | 


Example 3 


Using the notation of Section 6.2, assume Myy = Mig, Zig, = Zip and the sampling 
and censoring processes — 7,1,R and R —are ignorable. Under the models: 


Mig | Zip B, os N(Zip B. 2, ) 
B. =Bite, B+ 2, 
é, ~ independent N(0,Q,), ¢, L D” 


for i=1,...,b and t =1,...t, and where Q, is assumed to be known for every t, to 
simplify the illustration, then 


| Mec | D® | = N(p,, ¥,) 
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where 


, “4 x 
He O= Z Bye + Lirs™tss (M,. re Zs] ) 


-1 -1 -1 ; 
¥, = Ler ~ Lirstss™tsr a (Zi: 2 Zirs™tssZts Lilt (Z.. ~ LirstssZts ? 


A -1 4A rol 
Bue = Lut (Z-1B,—ae-1 + ZisZissMis | ) (13) 


2 
-1 rol 
Zit = (Zita + Zz Za) : 


tss 


In addition, the posterior mean of lic Mic +1ig Mis iS lich; +145Mi,, 


variance is lic lic. 
Proof of these results follows from observing that 


- B. Bap Zir-1 is N( Beata Ext) from (9), (10) and (11), and by replacing 
Bo by Byyp1 and Q by 2,1 in the results of Example 1; and likewise, 


x B =B,, and D, =2y - 
Now rewriting 
-1 A rl 
Let (2 5aB apa + ZisXtssMis 


-1 rol 5 rool rol ey 
= Zit {(2ae + ZisXissZts Baja + ZisdssMis a Bde hsb aca ’ 
(13) can be rewritten in the more familiar form: 
7 R eat n 
Bee a Baja + XirZisXtss (M., > Z.sBraj-1) : 


The results extended those derived in Tam (1987) and Tam (1988, Chapter 5) which 
treated Z, as a diagonal matrix. 
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and the posterior 
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7. PRIVACY AND PUBLIC TRUST 


The privacy landscape is fundamentally changed by the emergence of Big Data. There 
is an obvious contention between the systematic exploitation of Big Data sources for 
better decision-making across government, and the acknowledged need to establish 
and maintain public trust in the use of personal information by government agencies. 
NSOs’ operations are governed by, and their authority to undertake collections 
enshrined in, statistical legislation. This sets the ground rules for how such data sets 
can be acquired, combined, protected, shared, exposed, analysed and retained. The 
legislation and associated policy framework is designed to promote trust and privacy 
and Big Data sources will further test our decision-making in adherence to the 
framework. 


A significant unresolved issue is the threat of disclosure through data accumulation. 
Every individual is a unique mosaic of publicly visible characteristics and private 
information. In a data rich world, distinct pieces of data that may not pose a privacy 
risk when released independently are likely to reveal personal information when they 
are combined — a situation referred to as the “mosaic effect”. The use of Big Data 
greatly amplifies the mosaic effect because large rich data sets typically contain many 
visible characteristics, and so individually or in composition may enable spontaneous 
recognition of individuals and the consequential disclosure of their private 
information. This will be a significant issue when disseminating microdata sets from 
Big Data sources. 
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8. DATA OWNERSHIP AND ACCESS 


Data ownership and access is a key issue for NSOs and one where there is a generally 
lack of legislation and a supporting framework. The challenge is to unlock public 
good from privately collected data whilst protecting the commercial interests of the 
data custodians. 


In many cases, commercial value is placed on primary and derived non-government 
data sets by their owners, since either the provision of such data is the basis of their 
business, or its possession is a significant element of competitive advantage. This 
raises the issue of how the NSO might acquire commercially valuable or sensitive data 
for statistical production, particularly if the statistics compete directly with information 
products created by the data owner or they compromise its market position. This 
issue is made more complex by the fact that there may be several parties with some 
form of commercial right in relation to a data set, either through ownership, 
possession or licensing arrangements. 


Much Web content is also unstructured and ungoverned — the metadata describing its 
usage and provenance (origin, derivation, history, custody, and context) are either 
incomplete or incongruous. Indeed, the long-term reliability of Big Data sources may 
be an issue for ongoing statistical production. Reputable statistics for policy making 
and service evaluation are generally required for extended periods of time, often many 
years. However, large data sets from dynamic networks are volatile — the data sources 
may change in character or disappear over time. This transience of data streams and 
sources undermines the reliability of statistical production and publication of 
meaningful time series. 
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9. COMPUTATIONAL EFFICACY 


The exploitation of Big Data will have a significant impact on the ICT resource 
demands of data acquisition, storage, processing, integration, and analysis. Existing 
computational models for the most common statistical problems in the typical NSO 
scale very poorly for the number, diversity and volatility of data elements, attributes 
and linkages associated with Big Data sources. 


In particular, traditional relational database approaches are not sufficiently flexible for 
handling dynamic multiply-structured data sets in a computationally efficient way, and 
the execution of complex statistical algorithms at the scale of Big Data problems is 
likely to exceed the memory and processor resources of existing platforms. For 
example, probabilistic data linking under the Fellegi-Sunter model (Fellegi and Sunter, 
1969) is generally treated as a constrained Maximum Likelihood problem using 
simplex-based algorithms. The complexity of this problem is at least O(N*), which 
cannot be solved with existing computing resources when the size of the data set N is 
at the scale of Big Data. 


One possible approach is to outsource the analytics to the data owner. Statistics New 
Zealand is looking to do this with scanner data, as the data owner has the necessary 
computing infrastructure and performing the analysis where the data is stored is 
cheaper and easier. An added and important benefit of this approach is that the data 
owner does not need to share the underlying data, which may be very sensitive. A 
joint effort by methodologists and technologists is needed to develop techniques for 
reducing data volume and complexity while preserving statistical validity, and for 
improving algorithmic tractability and efficiency. This will involve explicitly recasting 
existing problems into a form that is better suited for distributed computing 
approaches, making greater use of approximate techniques, and favouring heuristic 
predictive models in the appropriate circumstances. 
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10. TECHNOLOGY INFRASTRUCTURE 


Big Data technology has emerged from the extreme scale of Internet processing and 
progressively been applied to a growing range of business domains in the last decade. 
Industry supported open source technology developments have rapidly matured to 
the point where ‘enterprise class’ processing — in conjunction with traditional 
processing technologies — provides a stronger integrated set of technology options. 
Stand-alone and ‘point’ Big Data solutions are diminishing as they are integrated into 
wider solution architectures. Most established technology suppliers now include Big 
Data technology as part of their product portfolio. Big Data infrastructure and tools 
are evolving and there will continue to be proprietary and point solutions. 


Big Data processing also requires new types of data representation (semantic data, 
graph database), inference (Al-based analytical techniques in conjunction with robust 
statistical analysis), visualisation (for complex network relationships), analytical 
languages (such as R and SAS), and the use of scale-out commodity hardware. A 
number of these technologies have value when applied to ‘traditional’ processing and 
analysis. 
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11. ABS INITIATIVES ON BIG DATA 


As Australia’s centralised national statistical agency, the ABS provides official statistics 
on a wide range of social, demographic, economic and environmental issues to 
encourage informed decision making, research and discussion within governments 
and the community. The principal legislation determining the functions and 
responsibilities of the ABS are the Australian Bureau of Statistics Act 1975 and the 
Census and Statistics Act 1905. 


A number of initiatives are being progressed to build future capability in the 
exploitation of Big Data sources and to position the ABS nationally and internationally 
as a leading agency in advanced data analytics. 


11.1 A Big Data strategy 


To position the ABS to harness Big Data opportunities, a Big Data Strategy paper 
(ABS, 2014) has been developed and approved by ABS senior management. The 
objectives of the Strategy are to build an integrated multifaceted capability for 
systematically exploiting the potential value of Big Data for official statistics. 


This capability comprises: 


e A skilled workforce able to interpret information needs and communicate the 
insights gleaned from rich data; 


e Advanced methods, tools and infrastructure to represent, store, manipulate, 
integrate and analyse large, complex data sets; 


e A diverse pool of government, private and open data sources available for statistical 
purposes; 
° Safe and appropriate public access to microdata sets and statistical solutions derived 


from an array of data sources; and 


° Strong multidisciplinary partnerships across government, industry, academia and 
the statistical community. 


11.2 Big Data Flagship Project 


The ABS Big Data Flagship Project — an initiative led by ABS methodologists — is 
intended to coordinate research and development (R&D) effort that will build a sound 
methodological foundation for the mainstream use of Big Data in statistical 
production and analysis. The desired outcomes of the project are to: 


e Promote a greater understanding of Big Data concepts, opportunities, practicalities 
and challenges within the ABS; 
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Encourage methodological rigour in the use of different sources of Big Data for 
statistical production, 


Build a seminal capability in exploring, combining, visualising and analysing large, 
complex and volatile data sets; 


Cultivate strong links to networks of Big Data experts in government, industry, 
academia, and the international statistical community; and 


Enhance national and international standing for the ABS in Big Data inference. 


The project has scheduled the following work packages: 


Environmental Scanning and Opportunity Analysis — survey the operational 
environment for Big Data sources of potential use in statistical production, and to 
identify business problems and 'pain-points' that can be addressed through non- 
traditional data sources and analytical methods; 


Remote Sensing for Agricultural Statistics — investigate the use of satellite sensor 
data for the production of agricultural statistics such as land use, crop type and crop 
yield; 


Mobile Device Location Data for Population Mobility — investigate the use of mobile 
device location-based services and/or global positioning for measuring population 
mobility; 


Predictive Modelling of Unemployment — investigate the application of machine 
learning to the construction of predictive small-area models of unemployment from 
linked survey and administrative data; 


Visualisation for Exploratory Data Analysis — investigate advanced visualisation 
techniques for the exploratory analysis of complex multidimensional data sets; 


Analysis of Multiple Connections in Linked Data — investigate Linked Data 
techniques for analysing multiply connected data entities at different levels of 
granularity; 


Predictive Modelling of Survey Non-Response — investigate the application of 
machine learning to the construction of predictive small-domain models of non- 


response behaviour using para data from past surveys; and 


Automated Content Analysis of Complex Administrative Data — investigate 
techniques for the automated extraction and resolution of concepts, entities and 
facts from multi-structured content in administrative data sets. 
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11.3 Participation in the Australian Public Service (APS) data analytics 
initiatives 
The ABS is a member of the Leadership Group of the APS Data Analytics Centre of 


Excellence (APS DACoE), which was formed in late 2013, to build collaborative 
capability across Government in the use of advanced data analytics by: 


e sharing technical and business knowledge, tools and techniques, skills development 
and standards for operating such as protocols for privacy and information 
management practices; 


° exploring and identifying opportunities to add business value through the use of 
analytics, considering: developments in information and knowledge management 
practices; industry developments in analytics technology, infrastructure and 
software; accreditation and professional development of analytics professionals for 
public-sector employment; and 


° identifying and providing advice to the Chief Information Officers Committee on 
common issues and concerns affecting the analytics capability; barriers to the 
effective use of Big Data; Big Data pilot projects; other actions as outlined in the APS 
Big Data Strategy. 


The DACoE has developed a best practice guide for Big Data/Big Analytics, which 
provides a whole-of-Government strategy on the use and implementation of Big Data 
amongst Australian Government agencies. It is currently compiling an inventory of 
business problems across government and the analytical methods and data sets that 
are being employed to solve them. The DACOE is also seeking to shape public sector 
engagement, recruitment and retention practices for data analysis professionals. 


11.4 Collaboration with research community 


ABS is establishing a collaboration network with leading Australian researchers in the 
field of data analytics to advance the research objectives of the Big Data Flagship 
Project. In particular, the project will draw on the expertise of the Image Processing 
and Remote Sensing Group at the Canberra campus of the University of New South 
Wales and the Advanced Analytics Institute at University Technology Sydney for areas 
such as satellite sensing and predictive modelling. 


ABS is also an industry partner of a Centre of Excellence for Mathematical and 
Statistical Frontiers of Big Data, Big Models and New Insights, headed by the eminent 
mathematical statistician, Professor Peter Hall, of University of Melbourne. The Centre, 
comprising a multi-disciplinary team of statisticians, mathematicians, computational 
specialists and computer scientists, is funded by the Australian Research Council for a 
total of A$20 million over seven years . As an industry partner, the ABS was able to 
influence the Centre’s research program to include research themes such as data 
fusion and integration, which are of significant interest to the ABS. 
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12. CONCLUDING REMARKS 


Official statisticians have been dealing with a diversity of data sources for decades. 
Whilst new sources from Big Data provide an opportunity for official statisticians to 
deliver a more efficient and effective statistical service, in deciding whether to 
embrace a particular Big Data source, we argue that there are a number of threshold 
considerations, namely, business need, business benefit, and the validity of using the 
source for official statistics for finite population inferences, or analytic inferences. The 
Data Quality Framework is useful in assessing the quality of the Big Data sources, and 
for assessing fitness of purpose of use of Big Data. 


This paper also provides a Bayesian framework for Big Data inferences, based on 
conceptualised transformation, sampling and censoring processes applied to the Big 
Data measurements. Proper inference will require modelling of all three processes, 
which can be very complex, if at all possible. However, in situations where ignorability 
conditions are fulfilled, inference can be made on the Big Data measurements as if 
they are acquired from a random sample. 


Until recently, ABS’ progress in Big Data domain has been primarily review and 
monitoring of industry developments while contributing to external strategic and 
concept development activities. The ABS Big Data Flagship Project provides the 
Opportunity to gain practical experience in assessing the business, statistical, technical, 
computational and other issues outlined in this paper. ABS participation in national 
and international activities on Big Data will also help it share experience and 
knowledge, and collaboration with academics will help ABS better acquire the 
capability to address business problems using Big Data as a part of the solution. 
Finally, these and related initiatives have been summarised in an ABS Big Data Strategy 
paper (ABS, 2014). 
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